Why Should We Care?

  • P-values and “statistical significance” are widespread in statistics and analytics–but they are controversial.
  • If you have been exposed to these ideas before, you have probably been taught a way of doing things that is now widely discredited.
  • Addressing these issues is difficult because the consensus among experts hasn’t filtered through to the teaching.

The Dreaded P-Value

What is all the fuss about?

What is a P-Value?

  • A p-value is “the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value” (Wasserstein and Lazar 2016).
  • This definition is technically correct, but what does it really mean? This question is the source of great debate (Schervish 1996; Aschwanden 2015; Gelman 2016; Greenland et al. 2016; Wasserstein and Lazar 2016)!
  • Put simply (but not technically correctly), p-values are a measure of how surprising the observed data would be if the null hypothesis (that no effect exists in the population) is true.

Regression Example

# set random seed
np.random.seed(42)

# sample size
N = 200

# generate predictors + noise
x1, x2, x3, noise = [np.random.normal(size=N) for _ in range(4)]

# outcome with specified effects
y = 1 + 0.1 * x1 + 0.5 * x2 + 0.01 * x3 + noise

# create dataframe
df = pd.DataFrame(dict(y=y, x1=x1, x2=x2, x3=x3))

# fit ols model
X = sm.add_constant(df[['x1', 'x2', 'x3']])
lm_results = sm.OLS(df.y, X).fit()

# render regression table
create_regression_table(lm_results)
Table 1: Linear Regression Results (N = 200))
Outcome = y
Estimate
(95% CI)
Std. Error t-statistic p-value
Intercept 1.03***
[0.89, 1.18]
0.07 14.29 0.000
x1 0.2*
[0.05, 0.35]
0.08 2.55 0.011
x2 0.39***
[0.25, 0.53]
0.07 5.34 0.000
x3 0.13
[-0.02, 0.27]
0.07 1.75 0.081
* p<0.05; ** p<0.01; *** p<0.001
R2=0.167; Adj. R2=0.155

It is All This Guy’s Fault

Ronald A. Fisher

And a Little Bit These Guys

Jerzy Neyman

Egon Pearson

And These Two Didn’t Help

Pierre Simon Laplace

Karl Pearson

A Brief History of P-Values

  • Early uses of significance testing date to the 1700s (Laplace, Arbuthnot), with modern form emerging ~1900s (Pearson, Gosset).
  • Ronald A. Fisher (1936) formalized p-value in Statistical Methods for Research Workers, suggesting p=0.05 as a convenient cutoff.
  • Neyman-Pearson (1933a; 1933b) introduced formal hypothesis testing with pre-chosen \(\alpha\) (0.05).
  • Fisher later regretted the use of a rigid threshold like 0.05, but it remains ingrained in practice.
  • Takeaway: p-values grew from Fisher’s flexible idea into an oft-rigid threshold, setting the stage for current debates.

Common Misinterpretations & Misuses

  • P-values and “statistical significance” are often misused. Sometimes deliberately.
  • They are often used as proof the effect is real or substantively important.
  • They are often treated as a measure of the strength of the effect, and as the most important piece of evidence in an anylsis.
  • We rarely acknowledge that it can vary from sample to sample and that multiple comparisons can lead to false positives.
  • P-hacking and other such practices can deliberately mislead.
  • Statistical significance encourages a flawed way of thinking.
  • And many, many more (Greenland et al. 2016).

Visualising the Problem

Using simulations to demonstrate issues with p-values

Simulate P-Value Distributions

  • Simulating 1,000 t-tests on zero effect (null) and 0.2 effect (alt), iteratively increasing sample size.
  • How should we expect p-values for the null and alt to be distributed? And how will sample size influence the distribution?
np.random.seed(42) # set random seed
n_sims = 1000 # number of simulations
effects = [0.0, 0.2] # simulated effect sizes
sample_sizes = [10, 20, 50, 100, 250, 500, 1000] # simulated sample sizes

# simulate p-values
p_values = simulate_pvals(effects, sample_sizes, n_sims)

P-Value Distributions (n=50)

fig, ax = plt.subplots(figsize=(8, 6))

sns.histplot(
    data=p_values.filter(pl.col("N") == 50).to_pandas(),
    x="p", hue="effect", palette=colors,
    alpha=0.7,  bins=20, edgecolor=".3"
)

ax.axvline(x=0.05, color='#D93649', linestyle='--', linewidth=5)

ax.get_legend().set_title("Effect Size")
plt.ylabel("Simulations")
plt.xlabel("P-Value")

plt.xlim(0, 1)
plt.tight_layout()
plt.show()

report_metrics(p_values, n=50)
Figure 1: n = 50 per group, 1000 simulations
Simulation Results:
- Sample Size Per Group: 50
- Simulations Per Condition: 1000
- False Positive Rate (H₀): 0.05 (Expected: 0.05)
- Statistical Power (H₁): 0.18 (Target: 0.8)
- Mean p-value (H₀ vs H₁): 0.50 vs 0.35

P-Value Distributions (n=100)

fig, ax = plt.subplots(figsize=(8, 6))

sns.histplot(
    data=p_values.filter(pl.col("N") == 100).to_pandas(),
    x="p", hue="effect", palette=colors,
    alpha=0.8,  bins=20, edgecolor=".3"
)

ax.axvline(x=0.05, color='#D93649', linestyle='--', linewidth=5)

ax.get_legend().set_title("Effect Size")
plt.ylabel("Simulations")
plt.xlabel("P-Value")

plt.xlim(0, 1)
plt.tight_layout()
plt.show()

report_metrics(p_values, n=100)
Figure 2: n = 100 per group, 1000 simulations
Simulation Results:
- Sample Size Per Group: 100
- Simulations Per Condition: 1000
- False Positive Rate (H₀): 0.05 (Expected: 0.05)
- Statistical Power (H₁): 0.30 (Target: 0.8)
- Mean p-value (H₀ vs H₁): 0.50 vs 0.29

P-Value Distributions (n=500)

fig, ax = plt.subplots(figsize=(8, 6))

sns.histplot(
    data=p_values.filter(pl.col("N") == 500).to_pandas(),
    x="p", hue="effect", palette=colors,
    alpha=0.7,  bins=20, edgecolor="black"
)

plt.axvline(x=0.05, color='#D93649', linestyle='--', linewidth=5)

ax.get_legend().set_title("Effect Size")
plt.ylabel("Simulations")
plt.xlabel("P-Value")

plt.xlim(0, 1)
plt.tight_layout()
plt.show()

report_metrics(p_values, n=500)
Figure 3: n = 500 per group, 1000 simulations
Simulation Results:
- Sample Size Per Group: 500
- Simulations Per Condition: 1000
- False Positive Rate (H₀): 0.05 (Expected: 0.05)
- Statistical Power (H₁): 0.89 (Target: 0.8)
- Mean p-value (H₀ vs H₁): 0.51 vs 0.03

Simulate Multiple Comparisons

  • Simulating running 50 tests where there is zero effect.
  • Running the simulation 1,000 times.
  • How often would we observe a “statistically significant” effect in at least one test?
np.random.seed(42) # set random seed
n_sims = 1000 # number of simulations
n_tests = 50 # number of tests per simulation
alpha = 0.05 # significance threshold

# simulate multiple comparisons
multiple_comparisons = simulate_multiple_comparisons(n_tests, n_sims, alpha)

Multiple Comparison Problems

expected = alpha * n_tests
fig, ax = plt.subplots(figsize=(8, 6))

sns.histplot(
    data=multiple_comparisons.to_pandas(), x="false_positives",
    bins=range(0, 12), color="#005EB8", edgecolor="black"
)

plt.axvline(x=expected, color='#D93649', linestyle='--', linewidth=5)

plt.xlabel("False Positives (p < 0.05)")
plt.xticks(range(0, 11, 1))
plt.legend([f"Expected ({n_tests} × {alpha} = {expected})"])

plt.tight_layout()
plt.show()

report_multiple_comparisons(multiple_comparisons, n_tests, n_sims, alpha)
Figure 4: 1000 simulations × 50 tests, no true effects
Simulation Results:
- Average number of false positives: 2.51
- Probability of at least one false positive: 92.8%
- Expected number of false positives: 2.5

Solutions to a Significant Problem

Recommendations for using p-values

Focus on Measuring Effect Sizes

  • Any analysis should focus on the magnitude and direction of the effect above all else.
  • The effect estimates are the part of the model that matters, and everything else, including p-values, is there to help measure model fit and identify potential issues.
  • This approach is rooted in the idea that the size of the effect is the bit that is directly grounded in the model’s context.
  • A p-value doesn’t tell us anything about the outcome we are studying. The effect size does.

Diagnostic (or Descriptive) P-Values

  • P-values do not tell you whether the results of your analysis are real, correct, or substantively important. But I don’t think p-values should be dismissed entirely.
  • Instead, we should use p-values as a tool for checking if we have enough data to observe the effect we are studying.
    • We are capable of asking questions and designing analyses where the effect will not be zero (Gelman, Hill, and Vehtari 2021).
    • If we start by assuming there is an effect, p-values tell us whether we have enough data to observe that effect.
  • Additionally, I think there is still value in using p-values as a quick check to identify whether two groups are different, or other similar descriptive questions, but p-values themselves are not a causal measure.

Same Effect, Different Sample Sizes

summary = (
    p_values
    .filter(pl.col("effect") == "0.2")
    .with_columns((pl.col("p") > 0.05).alias("non_sig"))
    .group_by("N")
    .agg([
        pl.col("non_sig").sum().alias("non_sig_count"),
        pl.len().alias("total")
    ])
    .with_columns(
        (pl.col("non_sig_count") /
        pl.col("total")).alias("rejection_rate")
    )
    .select(["N", "non_sig_count", "rejection_rate"])
    .sort("N")
)

format_power_table(summary)
Table 2: Proportion of Non-Significant Results (p > 0.05)
Sample Size Rejections Proportion
10 925 0.93
20 908 0.91
50 817 0.82
100 701 0.70
250 392 0.39
500 113 0.11
1000 5 0.01
Effect = 0.2, 1000 Simulations

Scrap Statistical Significance

  • While I am less of an evangelist when it comes to p-values, I won’t say the same for statistical significance.
  • Using a cutoff point for deciding whether an analysis is useful or not is bad practice.
  • “The difference between ‘significant’ and ‘not significant’ is not itself statistically significant” (Gelman and Stern 2006).

Wrapping Up

Summarising what we’ve learned

Key Takeaways

  • P-values are a useful tool, but we need to give them significantly less weight in the analysis process.
  • We should scrap statistical significance and focus on measuring effect sizes.
  • P-values and statistical significance was only ever a cheap shortcut, but doing science (or analytics) well is hard. We shouldn’t take shortcuts.

Further Reading

Additional Resources

Thank You!

Contact:

Code & Slides:

References

Aschwanden, Christie. 2015. “Not Even Scientists Can Easily Explain p-Values.” FiveThirtyEight.com.
Fisher, Ronald Aylmer. 1936. “Statistical Methods for Research Workers.”
Gelman, Andrew. 2016. “The Problems with p-Values Are Not Just with p-Values.” The American Statistician 70 (10): 1–2.
Gelman, Andrew, Jennifer Hill, and Aki Vehtari. 2021. Regression and Other Stories. Cambridge University Press.
Gelman, Andrew, and Hal Stern. 2006. “The Difference Between ‘Significant’ and ‘Not Significant’ Is Not Itself Statistically Significant.” The American Statistician 60 (4): 328–31.
Greenland, Sander, Stephen J Senn, Kenneth J Rothman, John B Carlin, Charles Poole, Steven N Goodman, and Douglas G Altman. 2016. “Statistical Tests, p Values, Confidence Intervals, and Power: A Guide to Misinterpretations.” European Journal of Epidemiology 31 (4): 337–50.
Neyman, Jerzy, and Egon S Pearson. 1933a. “The Testing of Statistical Hypotheses in Relation to Probabilities a Priori.” In Mathematical Proceedings of the Cambridge Philosophical Society, 29:492–510. 4. Cambridge University Press.
Neyman, Jerzy, and Egon Sharpe Pearson. 1933b. “On the Problem of the Most Efficient Tests of Statistical Hypotheses.” Philosophical Transactions of the Royal Society of London 231 (694-706): 289–337.
Schervish, Mark J. 1996. “P Values: What They Are and What They Are Not.” The American Statistician 50 (3): 203–6.
Wasserstein, Ronald L, and Nicole A Lazar. 2016. “The ASA Statement on p-Values: Context, Process, and Purpose.” The American Statistician. Taylor & Francis.